Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.7 - Equality Statements in Spark and Comparison with Nulls")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white |
Filtering Data
When you want ot filter data with more than just one expression, there are a couple of gotchas that you will need to be careful of.
Case 1: Multiple Conditions
(
pets
.where(
(F.col('breed_id') == 1) &
(F.col('color') == 'brown') &
F.col('color').isin('brown')
)
.toPandas()
)
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
What Happened?
When there exists more than 1 condition you will to wrap each condition in ()
brackets and as well provide bitwise operations instead of logical operations in Python.
Why?
This is because in the spark internals they had to overwrite the logical operations
and was only left with the bitwise operations
. This is to my best knowledge, I could be wrong.
Case 2: Nested Conditions
(
pets
.where(
(
F.col('breed_id').isin([1, 2]) &
F.col('breed_id').isNotNull()
) |
(F.col('color') == 'white')
)
.toPandas()
)
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
2 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white |
What Happened?
Similar to before, nested conditions will need to be wrapped with ()
as well.
Case 3: Equality Statements with Null
Values, (use isNotNull()
and isNull()
)
(
pets
.withColumn('result', F.col('color') != 'white')
.withColumn(
'result_2',
(F.col('color') != 'white') &
(F.col('color').isNotNull())
)
.toPandas()
)
id | breed_id | nickname | birthday | age | color | result | result_2 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | True | True |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | None | False |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | None | False |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white | False | False |
What Happened?
If you do not come from a sql
background any comparison with Null
will also be Null
, unless you specifically use the Null
comparisons.
The 2 Null
comparisons are isNotNull()
and isNull()
.
Summary
In spark when using a more involved conditional expression, you will need to wrap each condition with
()
brackets and use the bitwise operations in Python.Be explicit with you're performing conditional transformations on columns that can be
Null
.